Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Simplify single worker training #19814

Merged
merged 6 commits into from
Oct 28, 2021

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Oct 28, 2021

Currently, Ray Train does not setup the distributed environment (torch process group, TF_CONFIG env var) if only using 1 worker.

However, this requires the user to make changes to their training code if they want to go from 1 worker to multiple workers, and has been a source of confusion in our examples:
#19506
#19761

This PR changes the behavior to setup the distributed environment regardless of the number of workers. This allows training functions that have DistributedDataParallel or MultiWorkerMirroredStrategy to still work with single worker Ray Train. This PR also adds testing the quick start code examples in the docs.

Closes #19761

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome. Really like this pattern of running the documentation code in CI.

Can you add instructions and link this an example in [OSS] How to write an example? This is really a best-practice I think we should all be following. (Also be sure to include the trick of moving the function calls to __main__!)

@amogkam amogkam merged commit 1803d88 into ray-project:master Oct 28, 2021
@amogkam amogkam deleted the train-single-worker branch October 28, 2021 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] tensorflow_mnist_example fails with 1 worker
2 participants